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Donald B. Rubin 
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Abstract 

Most articles on missing values assume the missing data are "missing at 
random" and ignore the process that "caused" the missing values. The condition 
under which thic procedure is justified is explored here: the concept of 
missing at random is precisely defined, several examples are discussed, and 
"two simple conditions are given -which are sufficient to assure that the missing 
data are missing at random. 
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MISSIN ; AT l-ATJDOM - WliAT DCBF. IT -.ii^A?:? 

Donald Rubin 
Educational Testing Service 

1. ^Missing at Kanriom^^ as Used in the Literature 

In many articles on missing values there is an assumption either implicit 
or explicit that the missing data are '^missing at random" in the sense that 
the process that caused the missing values can be ignored. In som.e articles 
such as those concerned with the multivariate normal (Afifi ^ Flashoff , It'^'^' ; 
Anderson^, l^^ifl : Hartley F< Hocking, 1971; Hocking & Smith, 1<-.^6S ; WilKs, 1^.02), 
"missing at random" seems to mean that each item in the data matrix is equally 
likely to be missing. In other articles such as those dealing with the analy- 
sis of variance (Hartley, 1956; Healy 8c Westmacott, 19^6; Rubin, 197^: 
Wilkinson, 1958), "missing at random" seems to mean that observations of. the 
dependent variable are missing without regard to the actual values that would 
have been observed. Similarly, "missing at random" apparently can mean missing 
according to a preplanned experimental design (Hocking & Smith, 1972.; Trawinski 
& Bargmann, l^Ch). 

The objective here is to explore the t'-pecific assumptions that need to be 
made in order to ignore the process that caused the missing values when inves- 
tigating the density of the data. More specifically, the approach will be to 
examine the likelihood function of the observed data and the observed pattern 
of missing values and then to specify the condition under which solutions 
(e.g., maximum likelihood estimates and sampling distributions, Bayes posterior 
distributions) based on this likelihood agree with those based on the marginal 
likelihood of the observed data. 



2, Notation and a Definition 
V 

Lei. P be a m^obability density function for a real-valued vector 
6 

random variable y of order k , vhere c5 is a vector parameter vhicli lies 
in an open parameter space H . A sample realization of V is the data, 
i generally k = pn where p ^ number of * Variables" and n number of "units. 
We assume that the data analyst's primary objective is the invest irj;at ion of 
this density (e.g.^ estimating Q , testing hypotheses about r> , estimating 
a posterior density f or"^ ^? ). Let W be a 0-1 indicator random variable of 
length k , and let P^^ be the joint probability density function for V 
and W where is the vector parameter for this density. A sample 

realization of W -will indicate the missing values in the data. We have^ 
of course, that P^ = / P where / is the integral over the W 

random variable. We also define P^*^ = pY'^Vp^ to be the conditional 

density of the missing value indicator given the data where (i>ei1 . 

Let v,w be a sample realization of ' V^W . If w^ =^ 1 ^ v^ is an 

observed scalar random variable and thus is a real number. If w. = 0 , v. 

1 ^ 1 

is an unobserved scalar random variable. Thus v is composed of k-m real 
numbers and m unobserved scalar random variables^ where m is the number 
of missing values. Let v indicate the m -vector of unobserved random 
variables in v , i.e.^ the missing data. 

The likelihood function of all observables^ that is, the indicator 
variable and the observed data^ is 

(1) f tI'"" (v,w) 

V 



vhere p^^^ (v,w) is the density of V,,W evaluated at the observed values 
v7 aiid V regarded as a function of v and the parameters \ , and j 



of 



o 

V 



represents the integral over v y the unobserved scalar random variables^ 
This likelihood can also be written as 



where 



p! (V) 



^' o 
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/ p;-" (v,w) // (V) 



O 
V 



(2) 



V 



is the marginal likelihood of the observed data and 
(3) /^r'' (^''') // ^'I(^) 



o 

V 



o 

V 



is the conditional likelihood of the missing value indicator given the 
observed data. 



Definition ; The missing data v are said to be missing at random if tl^ie 



conditional likelihood of the missing value indicator given the observecU data , 
equation is independent of 0 . 

The motivation for this definition is that when the data are missing at 
random^ maximum likelihood estimates of 9 and their sampling distributions 
(as well as Bayes posterior densities for 0 ) obtained from the marginal 
likelihood of the observed data, equation (2), agree with those obtained from 
the full likelihood of all observables, equation (l). In this sense, if the 
data are missing at random, the observed data may be said to be "sufficient** 
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5. 'Mmyle Conditions Sufficient for the Missing Data to Missing- ■1^ 
Random ' " — • -j. 

hy rewriting r^'^''^ as P^, (v) P^^''^ (v,v) we have that equa^;ion (y). 

the conditional likelihood of the missing value indicator f^iven the observed 
data, can be written as 

(^) f (V) Kw) / /' p^; (V) . 

o o ^ 

V V 

Clearly, if p^^ ^ (v,w) i.s independent of H and v , equation (h) is inde- 
pendent of ; hence, the following result. 

Lemma: If (l) p^*^' (v,w) is independent of v , the missing data, and 
(2) and ^' lie in disjoint parameter spaces, 
then the missing data, v , are missing at random. 
Wie first condition in this lemma is satisfied by all of the examples 
given by the references cited in Section 1. '^Equally likely" miscin/^ values 
in the data matrix yield 

W-V ^ ^i 

Pq>* (^.v) = JT C> (1 ^ ^) ^ , w 0 or 1 , 

i=l 

where <^ is the probability of being observed. "Preplanned" missini^ observa- 
tions yield 

Ir 

" ^ • ^ J , w. - 0 or 1 , 
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where <J) is the 0-1 vector indicating the preplanned pattern of missing obser- 
vations and 6(a) - 1 if a =, 0 and zero otherwise. "Without regard to values 
that would have been observed" simply implies that pj'^ (w,v) is independent 
^ of the missing values v . As a more complex example, assume that odd v^ 



are always observed ami even v. are missing if the preceriiri:,: value v. .^^ 

is greater than ^ . Letting = {odd i, i ^- and ^ {even i, 

i = 2, . . . ,k} \Te have 

P^^*^^ &(1 ^ w ) TT 7(w V. . — ^0 



7(a,b) 



/ a = 0 and b > 0 , or 
"^"^ I a = 1 and b < 0 



0 otherwise 



If in these examples d and ^ lie in disjoint parameter spaces, both 
conditions in the lemma are satisfied and the missing data will he missing at 
random. If condition (l) in the lemma is satisfied but condition (2) is not, 
it is clear from equation (k) that the data are not missing at random; never- 
theless, ?naximum likelihood and Bayes procedures applied to the mar^^inal like- 

lihood of the observed data / P (v) are "reasonable" (e.g., consistent) 

*^o ^ 

V 

and suffer only from reduced "efficiency". Thus, in a sense, condition (l) 
in the lemma might have been chosen as the definition of missing at random. 
However, then discussion of maximum likelihood and Bayes procedures following 
an assumption of missing at random would always be somewhat imprecise and 
inaccurate. 

An argument could be made for choosing conditions (l) and (2) of the 

r 

lemma as the definition of missing at random because mciels not satisfying 
condition (l) intuitively do not seem to have missing data missing at random. 
For example, assiame the data' for odd i are uniform on (O,0) , and the 
data for even i are unifom on (0,l) and missing if less than <t> 
( 6 and 4) lie in disjoint parameter spaces); then by equation (3) the 
data are missing at random even though condition (l) is not satisfied. 



A^evertheless, if the phrasf:: "ciissing at random" is mear.t to ijr:plv that tne 
process that caused the missing values^ whatever it may be,, can be ignor'^d, 
the definition of missing at random given here in Section 1' is appropriate. 

^- Kxamples 

As a practical missing values problem consider the problem of nonresponse 
in sample surveys, where the parameters are the parameters of the joint 
distribution of response variables and background variables. Assume the 
nonrespondents are known to be typically different from the respondents, 
say^ to have lower socioeconomic status (SES). Are the data missing at random? 
Assume the researcher has recorded a measure of SES as well as other poten- 
tially relevant background variables for all subjects. If conditionally given 
t^ese observed background variables ^ a subject will offer or not offer his 
response independently of what that response would be, that is, if subjects 
with identical background variables (but possibly different responses) are 
equally likel;y to respond, then condition (l) in the lemma is satisfied; if, 
in addition, the parameters of the nonresponse process are independent of 0 , 
the missing data are missing at random. Hence by collecting ^'additional" 
variables the researcher can often make the assmption of missing at random 
plausible. 

However, even if the missing data are missing at random, the researcher's 
problem in choosing an appropriate model may be more serious than it would be 
if there were no missing data. For example, if the regression of response 
variables on background variables is curvilinear, and there are many missing 
responses when the values of the bacKground variables are extreme (e*g., low 
SES), fitting a linear model may yield especially poor prediction of the 
typical responses for those subjects likely to have missing responses. 



As another ezample of missing data consider nonrespons.e on .Tiultiple 
ciioice questionnaires. Lord (lvV5) makes the disr.inctior. i:!etweer. "not 
reached" items, which the examnee did not have time to r^tterript, and "omitted" 
items, vhich the examinee reached, presumably read, but did not answer. :r 
includes the parajneters of subject ability and item difficulty. If the items 
on the test are not ordered with respect to difficulty, it seems reasonable 
to assume, as does Lord, that condition (l) in the lemma holds for the not- 
reached items but does not ho2d for the omitted items; that is, P^^/*^ C^^^'V") 
is independent of the v corresponding to the not-reached items but does 
depend upon the v corresponding to the omitted items. Hov/ever, it also 
seems fairly clear that the parameters ^ and -3 may not lie in disjoint 
parameter spaces since more intelligent examinees probably reach more items 
and omit a lower proportion of items reached. Assuming that the number 
of items reached does not depend upon 0 , then the not-reached items are 
missing at random. 

The investigation of complex models for nonrandom missing!; values such as 
might be appropriate for Lord's data set is a relatively unexplored area of 
statistics. Only a few "censored-data'* models are commonly available for 
dealing with nonrandomly missing data (e.g., see Swan^ I969). 
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